Back

Journal of Open Source Software

The Open Journal

All preprints, ranked by how well they match Journal of Open Source Software's content profile, based on 22 papers previously published here. The average preprint has a 0.01% match score for this journal, so anything above that is already an above-average fit. Older preprints may already have been published elsewhere.

1
Distances and their visualization in studies of spatial-temporal genetic variation using single nucleotide polymorphisms (SNPs)

Georges, A.; Mijangos, L.; Patel, H. R.; Aitkens, M.; Gruber, B. R.

2023-05-11 bioinformatics 10.1101/2023.03.22.533737 medRxiv
Top 0.1%
23.0%
Show abstract

O_LIDistance measures are widely used for examining genetic structure in datasets that comprise many individuals scored for a very large number of attributes. Genotype datasets composed of single nucleotide polymorphisms (SNPs) typically contain bi-allelic scores for tens of thousands if not hundreds of thousands of loci. C_LIO_LIWe examine the application of distance measures to SNP genotypes and sequence tag presence-absences (SilicoDArT) and use real datasets and simulated data to illustrate pitfalls in the application of genetic distances and their visualization. C_LIO_LIEuclidean Distance is the metric of choice in many distance studies. However, other measures may be preferable because of their underlying models of divergence, population demographic history and linkage disequilibrium, because it is desirable to down-weight joint absences, or because of other characteristics specific to the data or analyses. Distance measures for SNP genotype data that depend on the arbitrary choice of reference and alternate alleles (e.g. Bray-Curtis distance) should not be used. Careful consideration should be given to which state is scored zero when applying binary distance measures to sequence tag presence-absences (e.g. Jaccard distance). C_LIO_LIMissing values that arise in the SNP discovery process can cause displacement of affected individuals from their natural groupings and artificial inflation of confidence envelopes, leading to potential misinterpretation. Filtering on missing values then imputing those that remain avoids distortion in visual representations. Failure of a distance measure to conform to metric and Euclidean properties is important but only likely to create unacceptable outcomes in extreme cases. Lack of randomness in the selection of individuals (e.g. inclusion of sibs) and lack of independence of both individuals and loci (e.g. polymorphic haploblocks), can lead to substantial and otherwise inexplicable distortions of the visual representations and again, potential misinterpretation. C_LI

2
TRaP: An Open-source, Reproducible Framework for Raman Spectral Preprocessing across Heterogeneous Systems

Zhu, Y.; Lionts, M. M.; Haugen, E.; Walter, A. B.; Voss, T. R.; Grow, G. R.; Liao, R.; McKee, M. E.; Locke, A.; Hiremath, G.; Mahadevan-Jansen, A.; Huo, Y.

2026-03-27 bioengineering 10.64898/2026.03.26.714582 medRxiv
Top 0.1%
17.4%
Show abstract

Raman spectroscopy offers a uniquely rich window into molecular structure and composition, making it a powerful tool across fields ranging from materials science to biology. However, the reproducibility of Raman data analysis remains a fundamental bottleneck. In practice, transforming raw spectra into meaningful results is far from standardized: workflows are often complex, fragmented, and implemented through highly customized, case-specific code. This challenge is compounded by the lack of unified open-source pipelines and the diversity of acquisition systems, each introducing its own file formats, calibration schemes, and correction requirements. Consequently, researchers must frequently rely on manual, ad hoc reconciliation of processing steps. To address this gap, we introduce TRaP (Toolbox for Reproducible Raman Processing), an open-source, GUI-based Python toolkit designed to bring reproducibility, transparency, and portability to Raman spectral analysis. TRaP unifies the entire preprocessing-to-analysis pipeline within a single, coherent framework that operates consistently across heterogeneous instrument platforms (e.g., Cart, Portable, Renishaw, and MANTIS). Central to its design is the concept of fully shareable, declarative workflows: users can encode complete processing pipelines into a single configuration file (e.g., JSON), enabling others to reproduce results instantly without reimplementing code or reverse-engineering undocumented steps. Beyond convenience, TRaP integrates configuration management, X-axis calibration, spectral response correction, interactive processing, and batch execution into a workflow-driven architecture that enforces deterministic, repeatable operations. Every transformation is explicitly recorded, making the full processing history transparent, inspectable, and reproducible. This eliminates ambiguity in how results are generated and ensures that identical protocols can be applied consistently across datasets and experimental contexts. Through representative use cases, we show that TRaP enables seamless, reproducible preprocessing of Raman spectra acquired from diverse platforms within a unified environment. We hope TRaP can empower Raman data processing as a reproducible, shareable, and systematized scientific practice, aligning it with modern standards for computational research. TRaP is released as an open-source software at https://github.com/hrlblab/TRaP

3
PAV-spotter: using signal cross-correlations to identify Presence/Absence Variation in target capture data

de Visser, M. C.; Ploeg, C. v. d.; Cvijanovic, M.; Vucic, T.; Theodoropoulos, A.; Wielstra, B.

2025-01-24 bioengineering 10.1101/2024.10.25.620064 medRxiv
Top 0.1%
16.8%
Show abstract

High throughput sequencing technologies have become essential in the fields of evolutionary biology and genomics. When dealing with non-model organisms or genomic gigantism, sequencing whole genomes is still relatively costly and therefore reduced-genome representations are frequently obtained, for instance by target capture approaches. While computational tools exist that can handle target capture data and identify small-scale variants such as single nucleotide polymorphisms and micro-indels, options to identify large scale structural variants are limited. To meet this need, we introduce PAV-spotter: a tool that can identify presence/absence variation (PAV) in target capture data. PAV-spotter conducts a signal cross-correlation calculation, in which the distribution of read counts per target between samples of different a priori defined classes - e.g. male versus female, or diseased versus healthy - are compared. We apply and test our methodology by studying Triturus newts: salamanders with gigantic genomes that currently lack an annotated reference genome. Triturus newts suffer from a hereditary disease that kills half their offspring during embryogenesis. We compare the target capture data of two different types of diseased embryos, characterized by unique deletions, with those of healthy embryos. Our findings show that PAV-spotter helps to expose such structural variants, even in the face of medium to low sequencing coverage levels, low sample sizes, and background noise due to mis-mapped reads. PAV-spotter can be used to study the structural variation underlying supergene systems in the absence of whole genome assemblies. The code, including further explanation, is available through the PAV-spotter GitHub repository: https://github.com/Wielstra-Lab/PAVspotter.

4
CelFDrive: Artificial Intelligence assisted microscopy for automated detection of rare events

Brooks, S.; Toral-Perez, S.; Corcoran, D. S.; Kilborn, K.; Bodensteiner, B.; Baumann, H.; Burroughs, N. J.; McAinsh, A. D.; Bretschneider, T.

2024-10-19 bioengineering 10.1101/2024.10.17.618897 medRxiv
Top 0.1%
14.8%
Show abstract

11.1 SummaryCelFDrive automates high-resolution 3D imaging cells of interest across a variety of fluorescence microscopes, integrating deep learning cell classification from auxiliary low resolution widefield images. CelFDrive enables efficient detection of rare events in large cell populations, such as the onset of cell division, and subsequent rapid switching to 3D imaging modes, increasing the speed for finding cells of interest by an order of magnitude. 1.2 Availability and ImplementationCelFDrive is available freely for academic purposes at the CelFDrive GitHub repository. and can be installed on Windows, macOS or Linux-based machines with relevant conda environments [1]. To interact with microscopy hardware requires additional software; we use SlideBook software from Intelligent Imaging Innovations (3i), but CelFDrive can be deployed with any microscope control software that can interact with a Python environment. Graphical Processing Units (GPUs) are recommended to increase the speed of application but are not required. On 3i systems the software can be deployed with a range of microscopes including their Lattice LightSheet microscope (LLSM) and spinning disk confocal (SDC). 1.3 Contacts.brooks.2@warwick.ac.uk

5
Mercator: An R Package for Visualization ofDistance Matrices

Abrams, Z. B.; Coombes, C. E.; Li, S.; Coombes, K. R.

2019-08-15 bioinformatics 10.1101/733261 medRxiv
Top 0.1%
12.2%
Show abstract

SummaryUnsupervised data analysis in many scientific disciplines is based on calculating distances between observations and finding ways to visualize those distances. These kinds of unsupervised analyses help researchers uncover patterns in large-scale data sets. However, researchers can select from a vast number of different distance metrics, each designed to highlight different aspects of different data types. There are also numerous visualization methods with their own strengths and weaknesses. To help researchers perform unsupervised analyses, we developed the Mercator R package. Mercator enables users to see important patterns in their data by generating multiple visualizations using different standard algorithms, making it particularly easy to compare and contrast the results arising from different metrics. By allowing users to select the distance metric that best fits their needs, Mercator helps researchers perform unsupervised analyses that use pattern identification through computation and visual inspection.\n\nAvailability and ImplementationMercator is freely available at the Comprehensive R Archive Network (https://cran.r-project.org/web/packages/Mercator/index.html)\n\nContactKevin.Coombes@osumc.edu\n\nSupplementary informationSupplementary data are available at Bioinformatics online.

6
CalciumInsights: An Open-Source, Tissue-Agnostic Graphical Interface for High-Quality Analysis of Calcium Signals

Gomez, D. S.; Rosas, N. C. P.; Contreras, G. I. M.; Brana, S. R. C.; Zhang, W.; Mim, M. S.; Tan, S. G.; Gazzo, D.; Tepole, A. B.; Deng, Q.; Reeves, G. T.; Isaza, C. E.; Staiger, C. J.; Umulis, D. M.; Zartman, J. J.; Rios, M. C.

2025-06-08 bioengineering 10.1101/2025.06.04.657923 medRxiv
Top 0.1%
12.1%
Show abstract

Fluctuations and propagation of cytosolic calcium levels at both the cellular and tissue levels show complex patterns, referred to as calcium signatures, that regulate growth, organ development, damage responses, and survival. The quantitative analysis of calcium signatures at the cellular level is essential for identifying unique patterns that coordinate biological processes. However, a versatile framework applicable to multiple tissue types, allowing researchers to compare, measure, and validate diverse responses and recognize conserved patterns across model organisms, is missing. Here, we present a post-processing tool, CalciumInsights, which leverages the R packages Shiny and Golem. This tool has a graphical user interface and does not require software programming experience to perform calcium signal analysis. The open-source software has a modular framework with standardized functionalities that can be tailored for various research approaches. CalciumInsights provides descriptive statistical analysis through various metrics extracted from dynamic calcium transients and oscillations, such as peak amplitude, area under the curve, frequency, among others. The tool was evaluated with fluorescence imaging data from three model organisms: Danio rerio, Arabidopsis thaliana, and Drosophila melanogaster, demonstrating its ability to analyze diverse biological responses and models. Finally, the open-source nature of CalciumInsights enables community-driven improvements and developments for enabling new applications. Author SummaryThis manuscript introduces CalciumInsights, an open-source tool for calcium signature analysis. Designed to be a versatile tool that works with various tissue types and biological systems, CalciumInsights has an easy-to-use graphical user interface. Our program simplifies metrics extraction while maintaining the quality of the analysis by integrating several algorithms. CalciumInsights stands out for its user-friendliness, ease of use, and robust data exploration features, such as tunable filters for improved accuracy. These features promote inclusivity and lower barriers to scientific research by making calcium signature analysis accessible to users of all programming skill levels.

7
genomalicious: serving up a smorgasbord of R functions for population genomic analyses

Thia, J. A.; Riginos, C.

2019-07-05 bioinformatics 10.1101/667337 medRxiv
Top 0.1%
8.8%
Show abstract

Turning SNP data into biologically meaningful results requires considerable computational acrobatics, including importing, exporting, and manipulating data among different analytical packages and programming environments, and finding ways to visualise results for data exploration and presentation. I present GENOMALICIOUS, an R package designed to provide a selection of functions for population genomicists to simply, intuitively, and flexibly, guide SNP data through their analytical pipelines, within and outside R. At the core of the original GENOMALICIOUS workflow is the conversion of genomic variant data into a data.table object. This provides a useful way of storing large amounts of data in an intuitive format that can be easily manipulated using methods unique to this object class. Over time, GENOMALICIOUS has grown to cater to a range of analyses in population structure and demography, adaptive evolution, quantitative traits, and phylogenetics. Researchers using pooled allele frequencies, or individually sequenced genotypes, are sure to find functions that accommodate their tastes in GENOMALICIOUS. The simplicity and accessibility of pipelines in GENOMALICIOUS may also serve as a useful tool for teaching basic population genetics and genomics in an R environment. The source code and a series of tutorials for this package are freely available on GitHub

8
Multi Locus View: An Extensible Web Based Tool for the Analysis of Genomic Data

Sergeant, M.; Hughes, J. R.; Hentges, L.; Lunter, G.; Downes, D.; Taylor, S.

2020-10-01 bioinformatics 10.1101/2020.06.15.151837 medRxiv
Top 0.1%
6.7%
Show abstract

MotivationTracking and understanding data quality, analysis and reproducibility are critical concerns in the biological sciences. This is especially true in genomics where Next Generation Sequencing (NGS) based technologies such as ChIP-seq, RNA-seq and ATAC-seq are generating a flood of genome-scale data. These data-types are extremely high level and complex with single experiments capable of mapping ten to hundreds of thousands of biologically meaningful events across the genome. However, such data are usually processed with automated tools and pipelines, generating tabular outputs and static visualizations. These are difficult to interact with and require substantial bioinformatic skills to manipulate and query. Similarly, interpretation is normally made at a high level without the ability to visualise the underlying data in detail and so the complexity and quality of the real underlying biological signal is lost. Also genomics datasets require integration with other genomics datasets to be properly interpreted and this integration with multiple tracks again requires substantial bioinformatics skills and is difficult to visualise across multiple pertinent datasets. Conventional genome browsers do allow for the detailed visualisation of multiple tracks but are limited to browsing single locations and do not allow for interactions with the dataset as a whole. MLV has been developed to allow users to fluidly interact with genomics datasets at multiple scales, from complete metadata labelled and clustered populations to detailed representations of individual elements. It has inbuilt tools to integrate signals across multiple datasets and to perform dimensionality reduction and clustering analysis based on the extracted signal, allowing for the high-level analysis of complex datasets while maintaining visualisation of the fine grain structure of the data. MLVs ability to visualise clustering within the data combined with efficient tools for large-scale tagging of individual elements makes it a unique tool for the generation of annotated datasets for modern machine learning approaches. ResultsMulti Locus View (MLV) is a web based tool for the visualisation, analysis and annotation of Next Generation Sequencing data sets. The user is able to browse the raw data, cluster, and combine the data with other analysis. Intuitive filtering and visualisation then enables the user to quickly locate and annotate regions of interest. User datasets can then be shared with other users or made public for quick assessment from the academic community. MLV is publically available at https://mlv.molbiol.ox.ac.uk and the source code is available at https://github.com/Hughes-Genome-Group/mlv

9
CommDivMap: Modelling and mapping species richness at different spatial scales

Miller, J. E.; Steinke, D.

2020-05-13 bioinformatics 10.1101/2020.05.11.089029 medRxiv
Top 0.1%
6.4%
Show abstract

1. Modern ecosystem models have the potential to greatly enhance our capacity to predict community responses to change, but they demand comprehensive spatial distribution information, creating the need for new approaches to gather and synthesize biodiversity data. 2. Metabarcoding or metagenomics can generate comprehensive biodiversity data sets at species-level resolution but they are limited to point samples. 3. CommDivMap contains a number of functions that can be used to turn OTU tables resulting from metabarcoding runs of bulk samples into species richness maps. We tested the method on a series of arthropod bulk samples obtained from various experimental agricultural plots. 4. The script runs smoothly and is reasonably fast. We hope that our assemble first, predict later approach to statistical modelling of species richness will set the stage for the transition from data-rich but finite sets of point samples to spatially continuous biodiversity maps.

10
anndata: Annotated data

Virshup, I.; Rybakov, S.; Theis, F. J.; Angerer, P.; Wolf, F. A.

2021-12-19 bioinformatics 10.1101/2021.12.16.473007 medRxiv
Top 0.1%
6.4%
Show abstract

anndata is a Python package for handling annotated data matrices in memory and on disk (github.com/theislab/anndata), positioned between pandas and xarray. anndata offers a broad range of computationally efficient features including, among others, sparse data support, lazy operations, and a PyTorch interface. Statement of needGenerating insight from high-dimensional data matrices typically works through training models that annotate observations and variables via low-dimensional representations. In exploratory data analysis, this involves iterative training and analysis using original and learned annotations and task-associated representations. anndata offers a canonical data structure for book-keeping these, which is neither addressed by pandas (McKinney, 2010), nor xarray (Hoyer & Hamman, 2017), nor commonly-used modeling packages like scikit-learn (Pedregosa et al., 2011).

11
Graphia: A platform for the graph-based visualisation and analysis of complex data

Freeman, T.; Horsewell, S.; Patir, A.; Harling-Lee, J.; Regan, T.; Shih, B. B.; Prendergast, J.; Hume, D. A.; Angus, T.

2020-09-03 bioinformatics 10.1101/2020.09.02.279349 medRxiv
Top 0.1%
6.4%
Show abstract

Quantitative and qualitative data derived from the analysis of genomes, genes, proteins or metabolites from tissue or cells are currently generated in huge volumes during biomedical research. Graphia is an open-source platform created for the graph-based analysis of such complex data, e.g. transcriptomics, proteomics, genomics data. The software imports data already defined as a network or a similarity matrix and is designed to rapidly visualise very large graphs in 2D or 3D space, providing a wide range of functionality for graph exploration. An extensive range of analysis algorithms, routines for graph transformation, and options for the visualisation of node and edge attributes are also available. Graphias core is extensible through the deployment of plugins, supporting rapid development of additional computational analyses and features necessary for a given analysis task or data source. A plugin for correlation network analysis is distributed with the core application, to support the generation of correlation graphs from any tabular matrix of continuous or discrete values. This provides a powerful analysis solution for the interpretation of high-dimensional data from many sources. Several use cases of Graphia are described, to showcase its wide range of applications. Graphia runs on all major desktop operating systems and is freely available to download from https://graphia.app/.

12
MooViE - Engine for single-view visual analysis of multivariate data

Stratmann, A.; Beyss, M.; Jadebeck, J. F.; Nöh, K.

2024-04-29 bioinformatics 10.1101/2024.04.26.591357 medRxiv
Top 0.1%
6.3%
Show abstract

SummaryUnderstanding input-output relationships within multivariate datasets is an ubiquitous task in the life and data sciences. For this, visual analysis is indispensable for providing expressive summaries and preparing decision-making. We present the visual analysis approach and software MooViE, which is designed to strike the balance between being tailored to the specific data semantic and while being broadly applicable. MooViE supports the data exploration process for extracting important information from the data and captures the result in a fresh single-view visualization. MooViE is implemented in C++ to facilitate fast access and effective interaction with comprehensive multivariate datasets. We showcase the engine for various application fields, relevant to the life sciences. Availability and ImplementationThe source code is available under MIT license at https://jugit.fz-juelich.de/IBG-1/ModSim/MooViE and https://github.com/JuBiotechMooViE, with detailed documentation and usage instructions (https://moovie.readthedocs.io), as well as zenodo-archived releases (https://doi.org/10.5281/zenodo.10997388). Platform independent Docker images are also available (jugit-registry.fz-juelich.de/ibg-1/modsim/moovie/moovie). ContactKatharina Noh k.noeh@fz-juelich.de

13
KmerSV: a visualization and annotation tool for structural variants using Human Pangenome derived k-mers

Meng, Q.; Ji, H. P.; Lee, H.

2023-10-15 bioinformatics 10.1101/2023.10.11.561941 medRxiv
Top 0.1%
6.3%
Show abstract

SummaryKmerSV is a visualization and annotation tool for structural variants (SVs). It can be applied to assembly contigs or long-read sequences. Using k-mers it rapidly generates images and provides genome features of SVs. As an important feature, it utilizes the new Human Pangenome reference which provide haploid specific assemblies, addresses limitations in prior references and improves the discovery of SVs. Availability and implementationKmerSV is implemented in Python and available at github.com/sgtc-stanford/kmerSV

14
Consensus Finder web tool to predict stabilizing substitutions in proteins

Jones, B. J.; Kan, C. N. E.; Luo, C.; Kazlauskas, R.

2020-06-30 bioengineering 10.1101/2020.06.29.178418 medRxiv
Top 0.1%
6.2%
Show abstract

The consensus sequence approach to predicting stabilizing substitutions in proteins rests on the notion that conserved amino acids are more likely to contribute to the stability of a protein fold than non-conserved amino acids. To implement a prediction for a target protein sequence, one finds homologous sequences and aligns them in a multiple sequence alignment. The sequence of the most frequently occurring amino acid at each position is the consensus sequence. Replacement of a rarely occurring amino acid in the target with a frequently occurring amino acid is predicted to be stabilizing. Consensus Finder is an open-source web tool that automates this prediction. This chapter reviews the rationale for the consensus sequence approach and explains the options for fine-tuning this approach using Staphylococcus nuclease A as an example.Competing Interest StatementThe authors have declared no competing interest.View Full Text

15
mutSigMapper: an R package to map spectra to mutational signatures based on shot-noise modeling

Candia, J.

2020-10-12 bioinformatics 10.1101/2020.10.12.336404 medRxiv
Top 0.1%
5.0%
Show abstract

SummarymutSigMapper aims to resolve a critical shortcoming of existing software for mutational signature analysis, namely that of finding parsimonious and biologically plausible exposures. By implementing a shot-noise-based model to generate spectral ensembles, this package addresses this gap and provides a quantitative, non-parametric assessment of statistical significance for the association between mutational signatures and observed spectra. Availability and implementationThe mutSigMapper R package is available under GPLv3 license at https://github.com/juliancandia/mutSigMapper. Its documentation provides additional details and demonstrates applications to biological datasets.

16
tidygenclust: Clustering for Population Genetics in R

Tysall, E. E.; Hovhannisyan, A.; Carter, E. J.; Padilla-Iglesias, C.; Colucci, M.; Pozzi, A. V.; Leonardi, M.; Fatima, A.; Pelanek, O.; Stephenson, N. P.; Manica, A.

2025-07-31 bioinformatics 10.1101/2025.07.29.667403 medRxiv
Top 0.1%
4.9%
Show abstract

BackgroundPopulation structure analysis is crucial for evolutionary research and medical genomics. Clustering methods, broadly categorized as model-based (e.g. ADMIXTURE) or non-model-based (e.g. SCOPE), differ in their methodology and computational efficiency. Recently, fastmixture, a model-based approach, has improved scalability and performance, while replicate alignment tools, such as Clumppling, extend previous methods by also aligning the modes across K values. However, all the existing tools are standalone and generate numerous untracked text files, as well as offering limited plot customisability. ResultsWe introduce an R package, tidygenclust, which brings the functionalities of the original ADMIXTURE, fastmixture and Clumppling software into R, enabling a streamlined and integrated workflow. By integrating with tidypopgen, a package designed to handle large SNP datasets, these new tools maintain metadata, simplify data handling, and produce results as customisable ggplot2 objects for flexible visualisation. ConclusionsThe R package tidygenclust advances population genetic analysis by combining computational efficiency with reproducible workflows and user-friendly plotting. The source code and instructions can be accessed on https://github.com/EvolEcolGroup/tidygenclust.

17
piqtree: A Python Package for Seamless Phylogenetic Inference with IQ-TREE

McArthur, R. N.; Wong, T. K. F. N.; Lang, Y.; Morris, R. A.; Caley, K.; Mallawaarachchi, V.; Minh, B. Q.; Huttley, G. A.

2025-07-16 bioinformatics 10.1101/2025.07.13.664626 medRxiv
Top 0.1%
4.9%
Show abstract

piqtree (pronounced pie-cue-tree) is an easy to use, open-source Python package that provides Python script based control of IQ-TREEs phylogenetic inference engine. piqtree builds IQ-TREE as a Python package, presenting a library of Python functions for performing many of IQ-TREEs capabilities including phylogenetic reconstruction, ultrafast bootstrapping, branch length optimization, model selection, rapid neighbor-joining, alignment simulation, and more. As piqtree explicitly uses IQ-TREEs phylogenetic algorithms, the computational and statistical performance of piqtree equal that of IQ-TREE. Modestly higher memory usage may be expected owing to the Python runtime and the need to load the alignment in Python. By exposing IQ-TREEs algorithms within Python, piqtree offers users a greatly simplified experience in development of phylogenetic workflows through seamless interoperability with other Python libraries and tools mediated by the cogent3 package. It enables users to perform interactive phylogenetic analyses and visualization using, for instance, Jupyter notebooks. We present the key features available in the piqtree library and a small case study that showcases its interoperability. piqtree is distributed for use as a standard Python package at https://pypi.org/project/piqtree/, documentation is available at https://piqtree.readthedocs.io and source code at https://github.com/iqtree/piqtree.

18
LoVis4u: Locus Visualisation tool for comparative genomics

Egorov, A. A.; Atkinson, G. C.

2024-09-14 bioinformatics 10.1101/2024.09.11.612399 medRxiv
Top 0.1%
4.9%
Show abstract

SummaryComparative genomic analysis often involves visualisation of alignments of genomic loci. While several software tools are available for this task, ranging from Python and R libraries to standalone graphical user interfaces, there is lack of a tool that offers fast, automated usage and the production of publication-ready vector images. Here we present LoVis4u, a command-line tool and Python API designed for highly customizable and fast visualisation of multiple genomic loci. LoVis4u generates vector images in PDF format based on annotation data from GenBank or GFF files. It is capable of visualising entire genomes of bacteriophages as well as plasmids and user-defined regions of longer prokaryotic genomes. Additionally, LoVis4u offers optional data processing steps to identify and highlight accessory and core genes in input sequences. Availability and ImplementationLoVis4u is implemented in Python3 and runs on Linux and MacOS. The command-line interface covers most practical use cases, while the provided Python API allows usage within a Python program, integration into external tools, and additional customisation. Source code is available at the GitHub page: github.com/art-egorov/lovis4u. Detailed documentation that includes an example-driven guide is available from the software home page: art-egorov.github.io/lovis4u.

19
Biosys-LiDeOGraM: A visual analytics framework for interactive modelling of multiscale biosystems

Mejean Perrot, N.; Layec, S.; Tonda, A.; Boukhelifa, N.; Fonseca, F.; Lutton, E.

2023-06-24 bioengineering 10.1101/2023.06.23.546209 medRxiv
Top 0.1%
4.9%
Show abstract

In this paper, we present a test of an interactive modelling scheme in real conditions. The aim is to use this scheme to identify the physiological responses of microorganisms at different scales in a real industrial application context. The originality of the proposed tool, Biosys-LiDeOGraM, is to generate through a human-machine cooperation a consistent and concise model from molecules to microbial population scales: If multi-omics measurements can be connected relatively easily to the response of the biological system at the molecular scale, connecting them to the macroscopic level of the biosystem remains a difficult task, where human knowledge plays a crucial role. The use-case considered here pertains to an engineering process of freeze-drying and storage of Lactic Acid Bacteria. Producing a satisfying model of this process is a challenge due to (i) the scarcity and variability of the experimental dataset, (ii) the complexity and multi-scale nature of biological phenomena, and (iii) the wide knowledge about the biological mechanisms involved in this process. The Biosys-LiDeOGraM tool has two main components that can have to be utilized in an iterative manner: the Genomic Interactive Clustering (GIC) module and the Interactive Multi-Scale modellIng Exploration (IMSIE) module, both involve users in their learning loops. Applying our approach to a dataset of 2,741 genes, an initial model, as a graph involving 33 variables and 165 equations, was first built. Then the system was able to interactively improve a synthetic version of this model using only 27 variables and 16 equations. The final graph providing a consistent and explainable biological model. This graphical representation allows various user interpretations at local and global scales, an easy confrontation with data, and an exploration of various assumptions. Finally Biosys-LiDeOGraM is easily transferable to other use-cases of multi-scale modelling using functional graphs. Author summaryThe use of "omics" data for understanding biological systems has become prevalent in several research domains. However, the data generated from diverse macroscopic scales used for this purpose is highly heterogeneous and challenging to integrate. Yet, it is crucial to incorporate this information to gain a comprehensive understanding of the underlying biological system. Although various integrative analysis methods that have been developed provide predictive molecular-scale models, they only offer a mechanistic view of the biological system at the cellular level. In addition, they often focus on specific biological hypotheses through dedicated case studies, making it difficult to apply their results to other scientific problems. To address these issues, we propose an interactive multi-scale modelling approach to integrate cross-scale relationships providing predictive and potentially explanatory models. A proof-of-concept tool has been developed and was validated in the context of the bioproduction of Lactococcus lactis, a bacterial species of high economic interest in the food industry and for which the control of the bioprocess is essential to guarantee its viability and functionality. Our approach can be applied to any biological system that can be defined through a set of variables, constraints and scales.

20
BioViz Connect: Web application linking CyVerse cloud resources to genomic visualization in the Integrated Genome Browser

Raveendran, K.; Kintali, C.; Tiwari, S.; Bole, P.; Freese, N. H.; Loraine, A.

2020-05-16 bioinformatics 10.1101/2020.05.15.098533 medRxiv
Top 0.1%
4.9%
Show abstract

Genomics researchers do better work when they can interactively explore and visualize data. However, due to the vast size of experimental datasets, researchers are increasingly using powerful, cloud-based systems to process and analyze data. These remote systems, called science gateways, offer user-friendly, Web-based access to high performance computing and storage resources, but typically lack interactive visualization capability. In this paper, we present BioViz Connect, a middleware Web application that links the CyVerse science gateway to the Integrated Genome Browser (IGB), a highly interactive native application implemented in Java that runs on the users personal computer. Using BioViz Connect, users can (i) stream data from the CyVerse data store into IGB for visualization, (ii) improve the IGB user experience for themselves and others by adding IGB specific metadata to CyVerse data files, including genome version and track appearance, and (iii) run compute-intensive visual analytics functions on CyVerse infrastructure to create new datasets for visualization in IGB or other applications. To demonstrate how BioViz Connect facilitates interactive data visualization, we describe an example RNA-Seq data analysis investigating how heat and desiccation stresses affect gene expression in the model plant Arabidopsis thaliana. Lastly, we discuss limitations of the technologies used and suggest opportunities for future work. BioViz Connect is available from https://bioviz.org.